Names: [Insert Your Names Here]

Lab 9 - Data Investigation 1 (Week 1) - Educational Research Data

Lab 9 Contents

  1. Background Information
    • Intro to the Second Half of the Class
    • Intro to Dataset 1: The Quantitative Reasoning for College Science Assessment
  2. Investigating Tabular Data with Pandas
    • Reading in and Cleaning Data
    • The describe() Method
    • Computing Descriptive Statistics
    • Creating Statistical Graphics
    • Selecting a Subset of Data
  3. Testing Differences Between Datasets
    • Computing Confidence Intervals
    • Visualizing Differences with Overlapping Plots
  4. Data Investigation 1 - Week 2 Instructions

In [ ]:
#various things that we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st

1. Background Information

1.1 Introduction to the Second Half of the Class

The remainder of this course will be divided into three two week modules, each dealing with a different dataset. During the first week of each module, you will complete a (two class) lab in which you are introduced to the dataset and various techniques that you need to use to explore it.

At the end of Week 1, you and your lab partner will write a brief (1 paragraph) proposal to Professor Follette detailing an investigation that you would like to complete using that dataset in Week 2. You and your partener will complete this investigation and write it up as your lab the following week. Detailed instructions for submitting your proposal are at the end of this lab. Detailed instructions for the lab writeups will be provided next week.

1.2. Introduction to the QuaRCS Dataset

The Quantitative Reasoning for College Science (QuaRCS) assessment is an assessment instrument that Profssor Follette has been administering in general education science classes across the country since 2012. It consists of 25 quantitative questions involving "real world" mathematical skills plus 24 attitudinal and demographic questions. It has been administered to more than 5000 students at eleven institutions. You will be reading the published results of this study for class on Thursday, and exploring the data in class this week and next.

A description of all of the variables (pandas dataframe columns) in the QuaRCS dataset and what each numerical answer choice "stands for" is in the file QuaRCS_descriptions.pdf.

2. Investigating Tabular Data with Pandas

2.1 Reading In and Cleaning Data


In [ ]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Read in the QuaRCS data as a pandas dataframe called "data".


In [ ]:
data=pd.read_csv('AST200_data_anonymized.csv', encoding="ISO-8859-1")
mask = np.where(data == 999)
data = data.replace(999,np.nan)

Once a dataset has been read in as a pandas dataframe, several useful built-in pandas methods are made available to us. Recall that you call methods with data.method. Check out each of the following


In [ ]:
# the * is a trick to print without the ...s for an ordinary python object
print(*data.columns)

In [ ]:
data.dtypes

2.2 The describe() method

There are also a whole bunch of built in functions that can operate on a pandas dataframe that become available once you've defined it. To see a full list type data. in an empty frame and then hit tab.

An especially useful one is dataframe.describe() method, which creates a summary table with some common statistics for all of the columns in the dataframe.

In our case here there are a number of NaNs in our table (cases where an answer was left blank), and the describe method ignores them for mean, standard deviation (std), min and max. However, there is a known bug in the pandas module that cause NaNs to break the quartiles in the describe method, so these will always be NaN for any column that has a NaN anywhere in it, rendering them mostly useless here. Still, this is a nice quick way to get descriptive statistics for a table.


In [ ]:
data.describe()

2.3. Computing Descriptive Statistics

You can also of course compute descriptive statistics for columns in a pandas dataframe individually. Examples of each one applied to a single column - student scores on the assessment (PRE_SCORE) are shown below.


In [ ]:
np.mean(data["PRE_SCORE"])

In [ ]:
#or
data["PRE_SCORE"].mean()

In [ ]:
np.nanmedian(data["PRE_SCORE"])

In [ ]:
#or
data["PRE_SCORE"].median()

In [ ]:
data["PRE_SCORE"].max()

In [ ]:
data["PRE_SCORE"].min()

In [ ]:
data["PRE_SCORE"].mode() 
#where first number is the index (should be zero unless column has multiple dimensions
# and second number is the mode
#not super useful for continuous variables for example, if you put in a continuous variable (like ZPR_1) it won't
#return anything because there are no repeat values

In [ ]:
#perhaps equally useful is the value_counts method, which will tell you how many times each value appears int he column
data["PRE_SCORE"].value_counts()

In [ ]:
#and to count all of the non-zero values
data["PRE_SCORE"].count()

In [ ]:
#different generally from len(dataframe["column name]) because len will count NaNs
# but the Score column has no NaNs, so swap this cell and the one before our with 
#a column that does have NaNs to verify
len(data["PRE_SCORE"])

In [ ]:
#standard deviation
data["PRE_SCORE"].std()

In [ ]:
#variance
data["PRE_SCORE"].var()

In [ ]:
#verify relationship between variance and standard deviation
np.sqrt(data["PRE_SCORE"].var())

In [ ]:
#quantiles
data["PRE_SCORE"].quantile(0.5) # should return the median!

In [ ]:
data["PRE_SCORE"].quantile(0.25)

In [ ]:
data["PRE_SCORE"].quantile(0.75)

In [ ]:
#interquartile range
data["PRE_SCORE"].quantile(0.75)-data["PRE_SCORE"].quantile(0.25)

In [ ]:
data["PRE_SCORE"].skew()

In [ ]:
data["PRE_SCORE"].kurtosis()

Exercise 1


Choose one categorical (answer to any demographic or attitudinal question) and one continuous variable (e.g. PRE_TIME, ZPR_1) and compute all of the statistics from the list above in one code cell (use print statements) for each variable. Write a paragraph describing all of the statistics that are informative for that variable in words. An example is given below for PRE_SCORE. Because score is numerical and discrete, all of the statistics above are informative. In your two cases, fewer statistics will be informative, so your explanations may be shorter, though you should challenge yourselves to go beyond merely reporting the statistcs, and should interpret them as well, as below.

*QuaRCS score can take discrete integer values between 0 and 25. The minimum score for this dataset is 1 and the maximum is 25. There are 2,777 valid entries for score in this QuaRCS dataset, for which the mean is 13.9 and the median is 14 (both 56\% of the maximum score). These are very close together, suggesting a reasonably centrally-concentrated score distrubution, and the low skewness value of 0.1 supports this. The kurtosis of the distribution is negative (platykurtic), which tells us that the distribution of scores is flat rather than peaky. The most common score ("mode") is 10, with 197 (~7%) of participants getting this score, however all score values from 7-21 have counts of greater than 100, supporting the flat nature of the distribution suggested by the negative kurtosis. The interquartile range (25-75 percentiles) is 8 points, and the standard deviation is 5.3. These represent a large fraction (20 and 32\%) of the entire available score range, respectively, making the distribution quite wide.

Your description of categorical distribution here

Your description of continuous distribution here


In [ ]:
#your code computing all descriptive statistics for your categorical variable here

In [ ]:
#your code computing all descriptive statistics for your categorical variable here

2.4. Creating Statistical Graphics

Exercise 2 - Summary plots for distributions

Warning: Although you will be using QuaRCS data to investigate and experiment with each type of plot below, when you write up your descriptions, they should refer to the general properties of the plots, and not to the QuaRCS data specifically. In other words, your descriptions should be general descriptions of the plot types that could be applied to any dataset.

2a - Histogram

The syntax for creating a histogram for a pandas dataframe column is:

dataframe["Column Name"].hist(bins=nbins)

Play around with the column name and bins and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this type of plot (not any individual plot that you've made) shows in words and describe when you think it might be useful.

Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the histogram tells you something important and use it as an example to inform your answer. Inputs that do not produce informative histograms should also help to inform your answer. Save a couple of representative histograms (good and bad, use plt.savefig("figure name")) and integrate them into your written (markdown) explanation to support your argument.


In [ ]:
#this cell is for playing around with histograms

Your explanation here, with figures

2b - Box plot

The syntax for creating a box plot for a pair of pandas dataframe columns is:

dataframe.boxplot(column="column name 1", by="column name 2")

Play around with the column and by variables and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this type of plot (not any individual plot that you've made) shows in words and describe when you think it might be useful.

Play around with inputs (e.g. column names) until you find a case that you think is well-described by a box and whisker plot and use it as an example to inform your answer. Inputs that do not produce informative box plots should also help to inform your answer. Save a couple of representative box plots (good and bad) and integrate them into your written explanation.


In [ ]:
#your sample boxplot code here

Your explanation here

2c - Pie Chart

The format for making the kind of pie chart that might be useful in this context is as follows:
newdataframe = dataframe["column name"].value()counts newdataframe.plot.pie(figsize=(6,6))

Play around with the column and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this type of plot (not any individual plot that you've made) shows in words and describe when you think it might be useful. In your explanation here, focus on how a bar chart compares to a histogram, and when you think one or the other might be useful.

Play around with inputs (e.g. column names) until you find a case that you think is well-described by a pie chart and use it as an example to inform your answer. Inputs that do not produce informative pie charts should also help to inform your answer. Save a couple of representative pie charts (good and bad) and integrate them into your written explanation.


In [ ]:
#your sample pie chart code here

Your explanation here

2d - Scatter Plot

The syntax for creating a scatter plot is:

dataframe.plot.scatter(x='column name',y='column name')

Play around with the column and refer to the docstring as needed until you understand thoroughly what is being shown. Describe what this type of plot (not any individual plot that you've made) shows in words and describe when you think it might be useful.

Play around with inputs (e.g. column names) until you find a case that you think is well-described by a scatter plot and use it as an example to inform your answer. Inputs that do not produce informative scatter plots should also help to inform your answer. Save a couple of representative pie charts (good and bad) and integrate them into your written explanation.


In [ ]:
#your sample scatter plot code here

Your explanation here

2.5. Selecting a Subset of Data

Exercise 3


Write a function called "filter" that takes a dataframe, column name, and value for that column as input and returns a new dataframe containing only those rows where column name = value. For example filter(data, "PRE_GENDER", 1) should return a dataframe about half the size of the original dataframe where all values in the PRE_GENDER column are 1.


In [ ]:
#your function here

In [ ]:
#your tests here

If you get to this point during lab time on Tuesday, stop here

3. Testing Differences Between Datasets

3.1 Computing Confidence Intervals

Now that we have a mechanism for filtering the dataset, we can test differences between groups with confidence intervals. The syntax for computing the confidence interval on a mean for a given variable is as follows.

variable1 = st.t.interval(conf_level,n,loc=np.nanmean(variable2), scale=st.sem(variable2))

where conf_level is the confidence level you with to calculate (e.g. 0.95 is 95% confidence, 0.98 is 98%, etc.) n is the number of samples and should generally be set to the number of valid entries in variable2 -1.

An example can be found below.


In [ ]:
## apply filter to select only men from data, and pull the scores from this group into a variable
df2=filter(data,'PRE_GENDER',1)
men_scores=df2['PRE_SCORE']

In [ ]:
#compute 95% confidence intervals on the mean (low and high)
men_conf=st.t.interval(0.95, len(men_scores)-1, loc=np.mean(men_scores), scale=st.sem(men_scores))
men_conf

Exercise 4


Choose a categorical variable (any demographic or attitudinal variable) that you find interesting and that has at least four possible values and calculate the condifence intervals on the mean score for each group. Then write a paragraph describing the results. Are the differences between the groups significant according to your data? Would they still be significant if you were to compute the 98% (3-sigma) confidence intervals?


In [ ]:
#code to filter data and compute confidence intervals for each answer choice

explanatory text

3.2 Visualizing Differences with Overlapping Plots

Exercise 5


Make another dataframe consisting only of students who "devoted effort" to the assessment, meaning their answer for PRE_EFFORT was EITHER a 4 or a 5 (you may have to modify your filter function to accept more than one value for "value").

Make overlapping histograms showing (a) scores for the entire student population and (b) scores for this "high effort" subset. The "alpha" keyword inside the plot commands will set the transparency of your histogram so that you can see both. Play around with it until it looks good. Make sure your chart includes a legend, and describe what conclusions you can draw from the result in a paragraph below the final chart.


In [ ]:
#modified filter function here

In [ ]:
#define your new high effort dataframe using the filter

In [ ]:
#plot two overlapping histograms

explanatory text here

4. Data Investigation - Week 2 Instructions

Now that you are familar with the QuaRCS dataset, you and your partner must come up with an investigation that you would like to complete using this data. For the next two modules, this will be more open, but for this first investigation, I will suggest the following three options, of which each group will need to pick one (we will divide in class):

  • Design visualizations that compare student attitudes pre and post-semester
  • Design visualizations that compare student skills (by topical area) pre and post semester
  • Design visualizations that compare students' awareness of their own skills pre and post semester

Before 5pm next Monday evening (3/27), you must send Professor Follette a brief e-mail (that you write together, one e-mail per group) describing a plan for how you will approach the problem you've been assigned. What do you need to know that you don't know already? What kind of plots will you make and what kinds of statistics will you compute? What is your first thought for what your final data representations will look like (histograms? box and whisker plots? overlapping plots or side by side?).


In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: